Include the names of other students you worked with here.
You will work with 3 data sets in this assignment. Two of the data sets will be explored in both R and Python. This allows you to direclty compare figures between the two languages. Two of the data sets are new while the third is a data set from the previous assignment. You will practice lumping low frequency counts in this assignment and consider the consequences of such actions. You will also practice visualizing continuous distributions using synthetic and real data.
The three data sets for this assignment are:
daily_avg in
this assignmentYou will explore the WPIAL and synthetic data in both R and Python. The EPA AQS data will only be explored in R.
The last problem, Problem 06, requires you to read in potential data for your final project. You may answer this problem in either Python or R.
Important: Please download all CSV files from the Canvas site and save them in the same directory as your RMarkdown file. The code chunks read in the CSV files assuming they are in the same directory.
Load or import the tidyverse in the code chunk
below.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The WPIAL is the Highschool athletic league in Western Pennsylvania.
Each year Highschools across Western PA compete to win the Western PA
title and move on to the State championship. For Baseball, the WPIAL is
divided into multiple classes based on Highschool district size. Schools
with fewer students compete in Class 1A while the largest Schools
compete in 6A. The WPIAL Baseball Champions data are read in for you in
the code chunk below. The data are assigned to the wpial
object. There are three columns in the data set:
Year: The year the competition was playedClass: The Class designation (School size)School: The name of the School that won the
championshipThe data you are working with does not consist of all WPIAL winners. You are working with data starting from 1979 even though the first WPIAL Baseball championship took place in 1914. You are working with more recent data because school districts were very different over 100 years ago.
wpial <- readr::read_csv('wpial_baseball_champs.csv', col_names = TRUE)
## Rows: 142 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Class, School
## dbl (1): Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use glimpse() to check the number of rows,
columns, and data types of the columns in
wpial.
Add you code chunk here.
wpial %>% glimpse()
## Rows: 142
## Columns: 3
## $ Year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2022, 2022, 2022, 2022, 202…
## $ Class <chr> "6A", "5A", "4A", "3A", "2A", "1A", "6A", "5A", "4A", "3A", "2A…
## $ School <chr> "Mt. Lebanon", "Shaler", "Hopewell", "Riverside", "Seton LaSall…
Create a bar chart to show the counts for the
Class variable.
Add you code chunk here.
wpial %>%
ggplot(mapping = aes(x = Class)) +
geom_bar(mapping = aes(group = Class)) +
theme_bw()
Which Class has the most rows? Class 2A
Which class has the fewest rows?
Class 5A and 6A
Create a bar chart to show the counts for the
School variable. The names should be relatively easy to
read.
wpial %>%
ggplot(mapping = aes(y = School)) +
geom_bar(mapping = aes(group = School)) +
theme_bw()
### 1e)
Create a bar chart to show the counts for the
School variable again. However, this time the bar chart
MUST be in ASCENDING frequency order. The names should be relatively
easy to read.
Add your code chunks here.
wpial %>%
count(School) %>%
arrange(School) %>%
mutate(proportion = n / sum(n),
rolling_sum = cumsum(n),
rolling_prop = rolling_sum / sum(n)) %>%
ggplot(mapping = aes(y = School)) +
geom_col(mapping = aes(x = rolling_prop)) +
theme_bw()
Let’s now consider the combination of School and
Class.
Create a STACKED bar chart to show the counts of the
School variable. The bar fill must be
associated with the Class variable. The bar chart must be
in ASCENDING frequency order. The names should be relatively easy to
read.
Use the color blind friendly color palette.
Question
wpial %>%
count(School, Class) %>%
arrange(School) %>%
mutate(proportion = n / sum(n),
rolling_sum = cumsum(n),
rolling_prop = rolling_sum / sum(n)) %>%
ggplot(mapping = aes(y = School)) +
geom_col(mapping = aes(x = rolling_prop)) +
geom_bar(mapping = aes(fill = Class))+
ggthemes::scale_color_colorblind() +
theme_bw()
Let’s now consider the combinations of School and
Class using a heat map.
Create a heat map to show the combinations of
School and Class. The heat map
fill must be associated with the count using a sequential
color scale. The count must also be annotated via text and the text
color must depend on the count.
Question
wpial %>%
count(School, Class) %>%
arrange(School) %>%
mutate(proportion = n / sum(n),
rolling_sum = cumsum(n),
rolling_prop = rolling_sum / sum(n)) %>%
ggplot(mapping = aes(x = Class, y = School)) +
geom_tile(mapping = aes(fill = rolling_prop),
color = 'black') +
scale_fill_viridis_c() +
theme_bw()
How many different Class championships have been
won by the School with the most total WPIAL
championships?
What do you think? I used a code chunk and it says 8, but when i try to knit, it gives me errors for trying to print n() at the last line so I deleted it.
wpial %>%
mutate(class = fct_infreq(School)) %>%
ggplot(mapping = aes(y =School)) +
geom_bar() +
theme_bw()
wpial %>%
group_by(School) %>%
summarise(total_championships = sum(n())) %>%
filter(total_championships == max(total_championships)) %>%
pull(School)
## [1] "North Allegheny"
wpial_counts <- wpial %>%
count(School, Class) %>%
group_by(School, Class)
How many schools have won a single WPIAL championship?
Add your code chunks here if you want to.
wpial %>%
count(School, Class) %>%
group_by(School, Class) %>%
filter(n == 1) %>%
pull(School)
## [1] "Avonworth" "Beaver" "Beaver"
## [4] "Beaver Falls" "Bishop Canevin" "Bishop Canevin"
## [7] "Blackhawk" "Blackhawk" "Blackhawk"
## [10] "Brownsville" "Canevin Catholic" "Canon-McMillan"
## [13] "Central Catholic" "Chartiers Valley" "Derry"
## [16] "Elizabeth Forward" "Fort Cherry" "Franklin Regional"
## [19] "Freeport" "Greater Latrobe" "Greensburg C.C."
## [22] "Hopewell" "Jeannette" "Knoch"
## [25] "Moon" "Mt. Lebanon" "New Castle"
## [28] "North Allegheny" "North Catholic" "North Hills"
## [31] "Northgate" "Norwin" "Pine-Richland"
## [34] "Quaker Valley" "Seneca Valley" "Serra Catholic"
## [37] "Shady Side Acad." "Shenango" "South Fayette"
## [40] "South Fayette" "Springdale" "Steel Valley"
## [43] "Steel Valley" "Sto-Rox" "Swissvale"
## [46] "Upper St. Clair" "Vincentian Acad." "Waynesburg Cent."
## [49] "West Allegheny" "West Allegheny" "West Mifflin"
## [52] "Wilmington"
Hopefully you have seen that there are many different Schools in the data set! We learned that LUMPING low frequency categories can help simplify figures when there are a large number of categories. However, it is critical to consider the consequence of such actions! Let’s see what those consequences could be with WPIAL data set.
There are multiple lumping procedures that you could consider, but let’s use a simple one in this assignment. You will lump based on a PROPORTION threshold. All LEVELS or CATEGORIES with LESS THAN a THRESHOLD proportion will be LUMPED into an OTHER category. In lecture, we used a proportion threshold of 5%. However, you will use a proportion threshold of 1% in this assignment.
Lump the low frequency categories of School into
an OTHER category. Name the OTHER category '(OTHER)'.
Visualize the counts of the remaining categories via a bar chart. The
bar chart must be in ASCENDING frequency order. The names should be
relatively easy to read.
Add your code chunks here.
wpial %>%
mutate(School = fct_lump_prop(School, prop = 0.01, other_level = '(other)' )) %>%
mutate(School = fct_infreq(School)) %>%
mutate(School = fct_rev(School)) %>%
ggplot(mapping = aes(y=School)) +
geom_bar() +
theme_bw()
wpial %>%
mutate(School_lump = fct_lump_prop(School, prop = 0.008, other_level = '(other)') )%>%
mutate(School_factor = as.factor(School))%>%
summary()
## Year Class School School_lump
## Min. :1979 Length:142 Length:142 (other) :31
## 1st Qu.:1996 Class :character Class :character North Allegheny: 8
## Median :2008 Mode :character Mode :character California : 6
## Mean :2006 Neshannock : 6
## 3rd Qu.:2017 Pine-Richland : 6
## Max. :2023 Riverside : 6
## (Other) :79
## School_factor
## North Allegheny: 8
## California : 6
## Neshannock : 6
## Pine-Richland : 6
## Riverside : 6
## Hopewell : 5
## (Other) :105
You will now visualize the combinations of the lumped
School with Class via a heat map.
Lump the low frequency categories of School into
an OTHER category. Name the OTHER category '(OTHER)'.
Visualize the counts of the combinations of the remaining
School categories and Class via a heat map.
The heat map fill must be associated with the count using a
sequential color scale. The count must also be annotated via text and
the text color must depend on the count.
You must continue to use a PROPORTION threshold of 1%.
Add your code chunks here.
wpial %>%
mutate(School = fct_lump_prop(School, prop = 0.01, other_level = '(other)') )%>%
count(Class, School) %>%
ggplot(mapping = aes(x = Class, y = School)) +
geom_tile(mapping = aes(fill = n,
color = 'black')) +
scale_fill_viridis_c() +
theme_bw()
How would you describe the impact of the LUMPING procedure?
Did it help? Did it modify the organizational nature of the
School categories?
The lumping procedure in this case grouped the low-frequency categories of the School variable into an “other” category. This helps by reducing the number of distinct categories and potentially simplifying the analysis. It can also help in handling rare or less significant categories that may not provide meaningful insights individually.
You executed the lumping within this assignment based on a proportion threshold dictated to you. In practice, you should always consider the category proportions BEFORE using an assumed threshold.
Visualize the proportion bar chart for School
without lumping. The bar chart must be in ASCENDING order. The names
should be relatively easy to read.
wpial %>%
mutate(School = fct_infreq(School),
School = fct_rev(School)) %>%
ggplot(mapping = aes(y =School)) +
geom_bar() +
theme_bw()
wpial %>%
mutate(School = fct_infreq(School),
School = fct_rev(School)) %>%
ggplot(mapping = aes(y =School)) +
geom_bar(mapping = aes(x= after_stat(count/sum(count) ))) +
theme_bw()
Why does the figure in 2d) explain what happened with the LUMPING procedure?
What do you think? I tnink there is a chance we have overwrite the variables
You will now explore continuous variable distributions. You will do
so using synthetic data generated for this assignment. The synthetic
data are read in for you in the code chunk below and assigned to the
df object. There are 9 continuous variables in the data
set. These 9 variables were randomly generated using distributions that
you will commonly encounter in data analysis tasks. Thus, exploring
these 9 variables will prepare you for common challenges you will
experience when exploring continuous variables.
df <- readr::read_csv('hw03_synthetic.csv', col_names = TRUE)
## Rows: 375 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): x1, x2, x3, x4, x5, x6, x7, x8, x9
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This problem is more open ended compared to others. You must create
the following visualizations for EACH variable within
df.
df %>% glimpse()
## Rows: 375
## Columns: 9
## $ x1 <dbl> 8.120639, 10.550930, 7.493114, 14.785842, 10.988523, 7.538595, 11.4…
## $ x2 <dbl> 9.960223, 14.826853, 10.615168, 5.542776, 9.304117, 6.307572, 9.218…
## $ x3 <dbl> 9.047559, 13.590404, 27.412923, 13.446587, 15.069754, 8.629539, 11.…
## $ x4 <dbl> 1.505284, 2.376810, 1.754423, 7.392501, 1.702202, 0.974089, 20.4096…
## $ x5 <dbl> 2.743911439, 0.517339097, 1.352082323, 1.390951409, 0.693554528, 1.…
## $ x6 <dbl> 2.4843101, 3.0625250, 0.4382848, 1.1873050, 1.3511870, 1.2452908, 0…
## $ x7 <dbl> 0.8941811, 0.8911175, 0.7456002, 0.9214166, 0.9253575, 0.8183267, 0…
## $ x8 <dbl> 2.3458047, 0.4173961, 0.6072066, 1.9577240, -2.5307052, -1.7593410,…
## $ x9 <dbl> 0.08401409, 0.96465840, 0.09453490, 0.09908694, 0.82311863, 0.89395…
df %>% ls()
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
df %>%
ggplot(mapping = aes(x = x1)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x2)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x3)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x4)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x5)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x6)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x7)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x8)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x9)) +
geom_histogram(bins = 25) +
theme_bw()
df %>%
ggplot(mapping = aes(x = x1)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x2)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x3)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x4)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x5)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x6)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x7)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x8)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
df %>%
ggplot(mapping = aes(x = x9)) +
geom_histogram(bins = 25,
fill = 'dodgerblue',
mapping = aes(y = after_stat(density))) +
geom_density(linewidth = 1.2, color = 'black') +
geom_rug() +
theme_bw()
quantile() as vertical reference linesdf %>%
ggplot(mapping = aes(x = x1)) +
geom_vline(xintercept = quantile( df$x1 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x2)) +
geom_vline(xintercept = quantile( df$x2 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x3)) +
geom_vline(xintercept = quantile( df$x3 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x4)) +
geom_vline(xintercept = quantile( df$x4 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x5)) +
geom_vline(xintercept = quantile( df$x5 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x6)) +
geom_vline(xintercept = quantile( df$x6 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x7)) +
geom_vline(xintercept = quantile( df$x7 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x8)) +
geom_vline(xintercept = quantile( df$x8 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x9)) +
geom_vline(xintercept = quantile( df$x9 ),
color = 'blue', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
quantile() as vertical
reference linesdf %>%
ggplot(mapping = aes(x = x1)) +
geom_hline(yintercept = quantile(df$x1),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x2)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x3)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x4)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x5)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x6)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x7)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x8)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
df %>%
ggplot(mapping = aes(x = x9)) +
geom_hline(yintercept = seq(0, 1, by = 0.25),
color = 'red', linetype = 'dashed', linewidth = 1.) +
stat_ecdf(linewidth = 2) +
labs(y = 'cumulative proportion') +
theme_bw()
You must insert all code chunks. Your solutions must be well
organized and easy to follow, but you may organize the code however you
see fit. You CANNOT use any packages outside of the
tidyverse to create the figures.
You must also answer the following conceptual questions for each variable:
You may use for-loops to aid in the creation of the figures, but that is not required. I would not use for-loops if I was in your shoes! However, it should be OBVIOUS which variable you are visualizing and discussing in your response.
HINT: The Week 03 readings will greatly help with this problem.
Add your code chunks here.
You explored the counts of categorical variables for the EPA AQS data
in the previous assignment. That data set included a continuous
variable, daily_avg, which was not explored. You will
explore that variable in this assignment! You will only explore the EPA
AQS daily_avg variable in R.
The code chunk below reads in the EPA AQS data and assigns the data
to the epa variable. The code chunk ASSUMES the EPA AQS
data are contained within the 'hw/02' sub-directory. The
file path includes '../02' which represents moving
backwards and then entering the '02'
subdirectory. Thus, the code chunk below assumes this RMarkdown is
within the 'hw/03' subdirectory.
epa <- readr::read_csv('../Week_02/epa_airdata_small.csv', col_names = TRUE)
## Rows: 34112 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): state, county, city, site, pollutant, units_of_measure
## dbl (1): daily_avg
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
epa %>% glimpse()
## Rows: 34,112
## Columns: 8
## $ state <chr> "Pennsylvania", "Pennsylvania", "Pennsylvania", "Penn…
## $ county <chr> "Allegheny", "Allegheny", "Allegheny", "Allegheny", "…
## $ city <chr> "Wilkinsburg", "Wilkinsburg", "Wilkinsburg", "Wilkins…
## $ site <chr> "1376", "1376", "1376", "1376", "1376", "1376", "1376…
## $ pollutant <chr> "Carbon monoxide", "Carbon monoxide", "Carbon monoxid…
## $ date <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019…
## $ daily_avg <dbl> 0.391957, 0.400957, 0.341391, 0.744391, 0.464826, 0.4…
## $ units_of_measure <chr> "Parts per million", "Parts per million", "Parts per …
Create the MARGINAL histogram for daily_avg
using an appropriate number of bins based on the sample
size.
Add your code chunks here.
epa %>%
ggplot(mapping = aes(x = daily_avg)) +
geom_histogram(bins = 17) +
theme_bw()
Although the previous figure seemed like a good place to start, the figure is incorrect. The previous histogram makes no sense for this application! Your categorical variable exploration in the previous assignment demonstrated why the histogram in 4a) makes no sense. However, let’s find out by creating several conditional histograms in this assignment.
Create the histogram of daily_avg again but this
time FACET based on the pollutant variable. The scales
should be free (or allowed to float) across the wrapped
facets.
Add your code chunks here.
epa %>%
ggplot(mapping = aes(x = daily_avg)) +
geom_histogram(bins = 11) +
facet_wrap( ~pollutant, scales = 'free' ) +
theme_bw()
Facets can be associated with more than one variable. Adding
additional variables to the FACETS is simple using the facet formula
interface. The syntax below shows how facets can be associated with TWO
variables within the facet_wrap() function:
facet_wrap( ~ <variable 1> + <variable 2> )
The <variable 1> and
<variable 2> names serve as place holders for the
facetting variables of interest. The + reads as “Facet
based on <variable 1> AND <variable 2>”.
Create the histogram of daily_avg again but this
time FACET based on two variables. The facets must be associated with
the pollutant variable AND the
units_of_measure variable. The scales should be free (or
allowed to float) across the wrapped facets.
epa %>%
ggplot(mapping = aes(x = daily_avg)) +
geom_histogram(bins = 11) +
facet_wrap( ~pollutant + units_of_measure , scales = 'free' ) +
theme_bw()
Add your code chunks here.
Now that you have visualized the distribution of
daily_avg conditioned on two important categorical
variables, let’s revisit why the histogram in 4a) makes no sense.
Why was the histogram in 4a) a poor figure to create for this application?
Because by default it was choosing only one of the options to plot
the histogram and neglecting the other variables that
daily_avg was conditioned to, such as
pollutant and units_of_measure
The EPA AQS data consists of measurements recorded at multiple sites
across cities and counties from Western PA. Thus, it would be
particularly interesting to consider the conditional
distribution of the daily_avg variable
GIVEN the site! This allows you to explore if the
daily_avg distributions are different at
different sites!
There are multiple approaches to visualize conditional distributions. Let’s start with the box plot within this assignment.
Create a boxplot for the daily_avg conditioned
on the city, pollutant, and
units_of_measure. The daily_avg variable must
be mapped to the x aesthetic, the city must be
mapped to the y aesthetic within the PARENT
ggplot() call. The facets must be associated with
pollutant and units_of_measure. You are
therefore creating facetted HORIZONTAL boxplots.
epa %>%
ggplot(mapping = aes(x = daily_avg, y = city)) +
geom_boxplot(fill = 'pink', color = 'purple') +
facet_wrap( ~pollutant + units_of_measure , scales = 'free' ) +
theme_bw()
Boxplots include additional aesthetics that can be associated with
categorical variables. The boxplot fill and
color can be used to represent a variable different from
the variable associated with the main aesthetic (the x
aesthetic for vertical boxplots and the y aesthetic for
horizontal boxplots). This allows you to group by an
additional variable. The boxplot is therefore summarizing the
distribution GIVEN the combination of the main aesthetic and the
fill aesthetic.
Create a boxplot for the daily_avg conditioned
on the city, county, pollutant,
and units_of_measure. The daily_avg variable
must be mapped to the x aesthetic, the city
must be mapped to the y aesthetic within the PARENT
ggplot() call. You must map the county
variable to the fill and color aesthetics
within the boxplot layer. Hard code the boxplot alpha to be
0.33. The facets must be associated with pollutant and
units_of_measure. You are therefore creating facetted
HORIZONTAL boxplots.
You may use the default fill and
color palettes.
epa %>%
ggplot(mapping = aes(x = daily_avg, y = city)) +
geom_boxplot(aes(fill = county, color = county), alpha = 0.33) +
facet_grid(pollutant ~ units_of_measure, scales = "free") +
theme_bw()
epa %>%
ggplot(mapping = aes(x =daily_avg , y = city)) +
geom_boxplot(aes(fill = county, color = county), alpha = 0.33) +
facet_wrap( ~pollutant + units_of_measure , scales = 'free' ) +
theme_bw()
What changed about the boxplots when you included the
additional county grouping?
What do you think?
When the additional county grouping is included in the boxplots, each
boxplot is now differentiated by both the city and county variables.
This means that within each city, different counties are represented by
different colors and fills. This allows for a more detailed analysis and
comparison of the daily_avg variable across cities and counties within
each city. The inclusion of the county grouping provides additional
insights into the variability of the data within each city, highlighting
any differences or similarities between counties within the same
city.
Now it’s time to start exploring a data set of your own choosing! Last assignment you considered several potential data sets for your final project. You may use one of those here or you can use a different data set for this problem. It is up to you!
Regardless of your choice of data set, you must perform the following actions:
library(tidytuesdayR)
big_tech_stock_prices <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-07/big_tech_stock_prices.csv')
## Rows: 45088 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): stock_symbol
## dbl (6): open, high, low, close, adj_close, volume
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(ggplot2)
library(readr)
library(dplyr)
big_tech_stock_prices %>% glimpse()
## Rows: 45,088
## Columns: 8
## $ stock_symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "…
## $ date <date> 2010-01-04, 2010-01-05, 2010-01-06, 2010-01-07, 2010-01-…
## $ open <dbl> 7.622500, 7.664286, 7.656429, 7.562500, 7.510714, 7.60000…
## $ high <dbl> 7.660714, 7.699643, 7.686786, 7.571429, 7.571429, 7.60714…
## $ low <dbl> 7.585000, 7.616071, 7.526786, 7.466071, 7.466429, 7.44464…
## $ close <dbl> 7.643214, 7.656429, 7.534643, 7.520714, 7.570714, 7.50392…
## $ adj_close <dbl> 6.515213, 6.526476, 6.422664, 6.410790, 6.453412, 6.39648…
## $ volume <dbl> 493729600, 601904800, 552160000, 477131200, 447610800, 46…
library(readr)
is_numeric <- sapply(big_tech_stock_prices, is.numeric)
print(is_numeric)
## stock_symbol date open high low close
## FALSE FALSE TRUE TRUE TRUE TRUE
## adj_close volume
## TRUE TRUE
str(big_tech_stock_prices)
## spc_tbl_ [45,088 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ stock_symbol: chr [1:45088] "AAPL" "AAPL" "AAPL" "AAPL" ...
## $ date : Date[1:45088], format: "2010-01-04" "2010-01-05" ...
## $ open : num [1:45088] 7.62 7.66 7.66 7.56 7.51 ...
## $ high : num [1:45088] 7.66 7.7 7.69 7.57 7.57 ...
## $ low : num [1:45088] 7.58 7.62 7.53 7.47 7.47 ...
## $ close : num [1:45088] 7.64 7.66 7.53 7.52 7.57 ...
## $ adj_close : num [1:45088] 6.52 6.53 6.42 6.41 6.45 ...
## $ volume : num [1:45088] 4.94e+08 6.02e+08 5.52e+08 4.77e+08 4.48e+08 ...
## - attr(*, "spec")=
## .. cols(
## .. stock_symbol = col_character(),
## .. date = col_date(format = ""),
## .. open = col_double(),
## .. high = col_double(),
## .. low = col_double(),
## .. close = col_double(),
## .. adj_close = col_double(),
## .. volume = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
unique_counts <- sapply(big_tech_stock_prices, function(x) {
if (is.factor(x)) {
nlevels(x)
} else {
length(unique(x))
}
})
print(unique_counts)
## stock_symbol date open high low close
## 14 3287 29798 30056 30028 30965
## adj_close volume
## 41279 43550
missing_counts <- colSums(is.na(big_tech_stock_prices))
print(missing_counts)
## stock_symbol date open high low close
## 0 0 0 0 0 0
## adj_close volume
## 0 0
You must create figures to start exploring the data. However, you do not need to explore ALL variables in this assignment. This is just to get you started!
The requirements depend on the variables in your data. Please see the bullets below for what is required.
big_tech_stock_prices %>%
count(stock_symbol) %>%
ggplot(mapping = aes(x = stock_symbol)) +
geom_bar(color = 'green', fill = 'pink') +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5, color = 'black') +
theme_bw()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
big_tech_stock_prices %>%
ggplot(mapping = aes(x = open)) +
geom_histogram(bins = 51, color = 'purple', fill = 'pink',
linewidth = 0.55,
alpha = 0.33) +
theme_bw()
big_tech_stock_prices %>%
ggplot(mapping = aes(x = high)) +
geom_histogram(bins = 51, color = 'purple', fill = 'pink',
linewidth = 0.55,
alpha = 0.33) +
theme_bw()
stock_symbol date open high low close adj_close volume FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
big_tech_stock_prices %>%
ggplot(mapping = aes(x = low)) +
geom_histogram(bins = 51, color = 'purple', fill = 'pink',
linewidth = 0.55,
alpha = 0.33) +
theme_bw()
big_tech_stock_prices %>%
ggplot(mapping = aes(x = close)) +
geom_histogram(bins = 51, color = 'purple', fill = 'pink',
linewidth = 0.55,
alpha = 0.33) +
theme_bw()
big_tech_stock_prices %>%
ggplot(mapping = aes(x = adj_close)) +
geom_histogram(bins = 51, color = 'purple', fill = 'pink',
linewidth = 0.55,
alpha = 0.33) +
theme_bw()
big_tech_stock_prices %>%
ggplot(mapping = aes(x = volume)) +
geom_histogram(bins = 51, color = 'purple', fill = 'pink',
linewidth = 0.55,
alpha = 0.33) +
theme_bw()
The number of bins in the histograms should be appropriate based on the sample size.
If you are having issues reading the data into R please contact Dr. Yurko right away!!!!!!